THE SPECTRUM OF RANDOM KERNEL MATRICES: 
UNIVERSALITY RESULTS FOR ROUGH AND VARYING 

KERNELS 

YEN DO AND VAN VU 



Abstract. Wg consider random matrices whose entries are f{XfXj) or f(\\Xi- 
Xjlp) for iid vectors Xi £ MP with normahzed distribution. Assuming that / 
is sufficiently smooth and the distribution of Xi 's is sufficiently nice, El Karoui 
| 17| showed that the spectral distributions of these matrices behave as if / is 
linear in the Marchcnko-Pastur limit. When Xi's are Gaussian vectors, vari- 
ants of this phenomenon were recently proved for varying kernels, i.e. when 
/ may depend on p, by Cheng-Singer |13) . Two results are shown in this 
paper: first it is shown that for a large class of distributions the regularity 
assumptions on / in El Karoui's results can be reduced to minimal; and sec- 
ondly it is shown that the Gaussian assumptions in Cheng— Singer's result can 
be removed, answering a question posed in 1131 about the universality of the 
limiting spectral distribution. 



1. Introduction 

Let Xi,. . . ,Xn € K'' be iid random vectors with normalization E[Xi] = and 
E 1 1 1 1 2 ^ 1 , here 1 1 . 1 1 denotes the Euchdean length in W . For any : x Rp x R ^ 
R symmetric in the first two variables, consider the random matrix A with entries 

(1.1) A,,=F{X,,Xj,p) , 

or the variant with zeros on the diagonal 

V(X„X„p), i^j 

Following previous literature |17[ 113] , in this paper these matrices will be refered 
to as random kernel matrices generated by F and the distribution of XiS. As 
described in [17j , practical examples of F are of the form 

F{X,Y,p)=f{X'^Y,p) or f{\\X -Y\\\p) , 

More generally, one could have F{X,Y,p) = f{g{X,Y),p) for some symmetric 
5 : RP X RP — >■ R; some Lemmas in this paper are stated in this setting under 
suitable normalizing assumptions on g (relative to Xis). For convenience, g will 
be refered to as the kernel and / will be refered to as the envelope that generate 
A. Examples of envelope functions are f{x) = exp(a;a), f{x) = (1 -I- x)°, where a is 
fixed; others can be found in Rasmussen- Williams [30] and Williams-Seeger [39] . 



(1.2) A. 
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We will be interested in weak-limit of the empirical distribution pA of A when 
p, n — > oo such that p/n — > 7 G (0, 00), a fixed constant. Recall that 



here Ai, . . . , A„ are eigenvalues of A and 5\ denotes the counting measure at A. This 
research direction has been investigated recently by El Karoui [17] and Cheng- 
Singer [HI, motivated by studies from machine learning and statistical analysis. 
In [17], it was assumed that A is generated by either the inner-product or the 
distance kernels, and with p-independent envelope functions (which is the natural 
setting relative to the above normalization of Xi). It was shown in [T7] that for / 
sufficiently smooth the limiting behavior of pA depends only on a linear component 
of /. It turns out that a variant of this phenomenon continues to hold even if / 
depends on p: for g{X, Y) ~ X^Y this was proved for Gaussian random vectors in 
a recent result of Cheng and Singer [T3] . See also Bordenave [5] for a related recent 
work in this direction that appeared after an initial circulation of a first draft of 
this paper. 

The goal of this paper is to extend the results in [T71[T3] to more general settings. 
In particular. Theorem [3] will (positively) answer a question by Cheng and Singer 
[T3] regarding the universality of the limiting spectral distribution of Cheng-Singer's 
models. 

We would like to point out that El Karoui [TS] also considered a related model 
where the entries g{Xf Xj) are affected by random noise before the envelope / is 
applied outside; the reader is refered to the beautiful work [T^ for further details. 
There is also a vast amount of literature concerning limiting behaviors of pA when 
p is low or fixed, the interested reader is refered to [Ml ES] EH [23l lU El [22] and 
references there-in. 

For clarity, below the discussion of previous and new results and are divided into 
two sections. 

1.1. The p-independent setting. In this section, the setting when F is indepen- 
dent of p (relative to the above normalization of Xi) will be discussed. In other 
words, only the settings when F{X,Y,p) = f{X'^Y) or f{\\X — (for some 

p-indepcndent envelope function /) will be considered. Since the vectors X, are 
normalized, this is the natural setting for /. 

1.1.1. The inner product kernel. Let F{X,Y,p) ~ /{X^'^Y). When the limiting 
spectral distribution for the model (|l.ip of A is known, standard arguments may 
be used to deduce the limiting spectral distribution of (|1.2p (see e.g. [S] or [T7], see 
also Lemma [2] of the current paper), and vice versa. Below the model (|1.1|) will be 
assumed unless otherwise stated. 

For linear envelope functions, it is well-known that if the distribution of the 
vectors X^'s satisfies certain martingale/concentration properties then pA converges 
weakly to some form of the Marcenko-Pastur (MP) distribution, whose density is 
given by 
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This convergence was first established by Marcenko-Pastur [25] (see also Wachter 
|38j ) when the entries of each vector Xi are iid. Various authors have then extended 
this result to more general settings, see e.g. Auburn [5], Yin and Krisnaiah |40) . 
Silverstein [31], Gotze and Tikhomirov [19l[20], El Karoui [HI [17], Adamczak [I], 
Pajor and Pastur [28], Bordenave et al. [7], Chafai [10], Chatterjee et al. [12] . 
and O'Rourke [27]. In particular, the result holds for X^'s drawn independently 
from isotrophic log-concave distributions, this is a result of Pajor and Pastur [28]. 
Extensions to settings with some martingale-type assumptions were carried out in 
[TOl [ini [I] , and extensions to settings with some concentration conditions on the 
distributions of Xi 's were done in [15] . See also [27] [T] [121 dH] for other generaliza- 
tions. 

For nonlinear envelope functions with sufficient smoothness, it was shown by El 
Karoui that if the distribution of Xi^s is sufficiently nice then A has the same 
the limiting spectral distribution as 

B = [/(I) - /(o) - /'(o)]/„ + rmxix,),^, . 

Here and in the rest of the paper, /„ is the identity n x n matrix. Let a := 
/(I) - /(O) - /'(O). Using the hnear theory, it follows that 

fi3) IwmP^'p-^^fm^' if/'(o)^o; 

^ ' ' ^ ' Ua, if/'(0)=0. 



In El Karoui [17], the convergence (|1.3|) was considered in two different settings: 
(i) The iid setting with K moment bounds: Assume that 



(1.4) 



(1.5) 



J the entries of Xi are iid with 
\E|VPX,,|^ = 0(1) . 

In this setting, it was shown in that (|1.3p holds if: K > 4 and / is near 
and near 1. 

(ii) The high concentration setting with parameter c{p): Assume that 

{for any 1-Lipschitz function F 
there exists absolute constants C,b > such that 
Pi\FiX,) - 771f\ >t)< C cxp{^c{p)tb) , yt>0 , 
here mp denotes a median of F{Xi). 

In this setting, it was shown in [17] that (jl.Sp holds if two conditions hold: 

• / is near and near 1. and 

• c{p) > Cp^+^Z^ for some absolute constant C > 0. (For simplicity we'll 
write c{p) > 0(p"+''/4).) 

In the iid setting (|1.4p . it was shown in [T^ that a stronger convergence in 
spectral norm holds. In particular, it follows that some information about the 
largest eigenvalue of B could be transfered to A. The interested reader is refered 
to [TT] [16] and the references there-in for related literature. 

The estimate (|1.5p is satisfied for a large class of distributions, including: 

(a) Xi^s are Gaussian vectors (which is clearly a special case of (|1.4|) ): 

(b) Xj's are sampled from the unit sphere. 

(c) Xi's are sampled from a distribution with log-concave density e~^^^^ such 
that Hess U{x) — c{p)Id is positive definite. (In this case 6 = 2, see e.g. [24].) 
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Other examples can be found in [171 IE] and [24] . 

In the special cases (a,b) above, El Karoui's results were improved recently by 
Cheng-Singer jl3j , where the authors showed that similar results hold for the vari- 
ant (|1.2p under the weaker assumption that / is near 0. 

An initial examination of Karoui's results reveals that one only needs differen- 
tiability of / at to formulate the above linear component B of A. On closer 
looks, perhaps continuity of / at 1 is also required, since the diagonal entries of the 
covariance matrix of Xi are converging to 1 in the large n large p limit; except for 
the zero-diagonal model (|1.2p . 

In the first result of this paper, it will be shown that under these minimal reg- 
ularity assumptions on / the nonlinear-to-linear results of |17| can still be proved 
for a large class of distributions. Similar settings for distribution of Xi will be 
considered: 

• The idd setting (jl.4[) with K > A moment bounds. 

• The high concentration setting ()1.5p with c{p) > 0{p^/'^). 

While our assumption c{p) > 0{p''^^) is stronger than El Karoui's assumption, it 
is in fact satisfied by a fairly large class of interesting distributions (see [24] and 
also [17] for many examples); also a recent work of Guedon-Milman [21] (cf. [29] ) 
indicates that such concentration assumption may be true for the isotrophic log- 
concave setting (see also the discussion following Conjecture [T] for details). On 
the other hand, Theorem [T] requires only minimal regularity assumptions on /; one 
might view this as a trade-off between regularity of / and concentration assumptions 
on the population distribution. 

Theorem 1. Assume the iid setting (jl.4p with K > A, or the high concentration 
setting with c{p) > Oip^'"^). 

(i) Let f be differentiable at and continuous at 1. Let A be defined by p.ip 
with F(X,Y,p) ~ f{X'^Y). Then A has the same limiting spectral distribution as 

B = [/(I) - /(o) - /'(o)]/„ + rmxix,),^, . 

(a) Let f be differentiable at 0. Let A be defined by (|1.2p with F(X,Y,p) = 
/(X"^y). Then A has the same limiting spectral distribution as 

B = [-/(O) - /'(0)]/„ + /'(0)(Xf X,).j . 
Theorem [T] will be shown in Section |3J 

1.1.2. The distance kernel. Let F{X,Y,p) ^ f{\\X -VW^). This model has recently 
attracted the attention of some authors (see e.g. [HI [22l |8] ) , motivated by connec- 
tions to machine learning theory and physics. In this model, it is clear that the 
two settings (jl.ip and ()1.2p are equivalent up to a shift (by /(O)) of the limiting 
spectral distribution. Below it will be assumed that A is defined using p.ip . 

When {Xi) are Bernoulli or sampled from the unit sphere, the distance kernel 
model follows from the inner product model, but it is not hard to find interesting 
examples (such as Gaussian or log-concave) where a naive adaptation of this reduc- 
tion does not apply. This however suggests that A should have the same limiting 
spectral distribution as 



B = [/(O) - /(2) + 2/'(2)]/„ - 2/'(2)(Xf X,).,, 
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when / is sufficiently smootli and tlie distribution of Xi is sufficiently concentrated. 
This was shown in El Karoui [17], where the author assumed essentially the same 
settings for the XiS as in the last section: 

• In the iid setting ()1.4|) with K > ^ moment bounds, this was shown for / 
being near 2. 

• In the high concentration setting (jl.SP with c{p) > 0(p'^+''/'*) this was 
shown for / being near 2 and near 0. 

In the iid setting, it was furthermore shown in |17| that a stronger convergence 
holds in spectral norm, which may lead to more information about the distribution 
of the largest eigenvalue of A. (The interested reader is refered to to |T71 [TS] and 
refcfences therc-in for related literature.) As remarked earlier, the limiting spectral 
distribution of B may be computed explicitly using Marchenko-Pastur theory. 

It is clear that one only requires differentiability of / at 2 to write down the 
above linear component B of A. Theorem [2] below shows that for a large class 
of distributions of Xi, El Karoui's nonlinear-to-linear results for distance random 
matrices can be proved for / assuming only this differentiability. 

Theorem 2. Assume the iid setting \1A\ with K > 4, or the high concentration 
setting with c{p) > 0{p''^'^). 

Let A be defined using p.ip with F{X, Y^p) = /(||^ — where f is differen- 

tiable at 2. Then A has the same limiting spectral distribution as 

B = [/(O) - /(2) + 2/'(2)]/„ - 2/'(2)(XrX,).,, . 

Besides regularity improvement, in the iid setting Theorem [2] requires less mo- 
ment bounds than El Karoui [T7|. As discussed before in the paragraph leading 
to the statement of Theorem [1] our assumption on c(p) is stronger than that in 
[T7] . but is satisfied by a large class of interesting distributions, see [53] and [T7] 
for many examples. A recent work of Guedon-Milman [3T] (cf. [53]) indicates that 
such concentration inequality may hold in the isotrophic log-concave setting, see 
Conjecture [2] for details. 

Theorem [2] will be proved in Section [3] 

1.1.3. Conjectures. We conjecture that similar results hold in the isotrophic log- 
concave case. 

Conjecture 1. The results of Theorem[^ hold if Xi 's are iid random vectors from 
a normalized isotropic log-concave distribution. 

Conjecture 2. The results of Theorem\^ hold if Xi 's are iid random vectors from 
a normalized isotropic log-concave distribution. 

Recall the following result of Gucdon-Milman [21] (cf. Paouris [29]): 

Proposition 1. [21j Let X be a (normalized) random vector sampled from an 
isotropic measure in W with log- concave density, then for any t > 

-l\>t)< Cexp(-c//Vin(i,t^)) 

for some absolute constants c, C G (0, oo). 

It follows from the proofs of Thcorcm[T]and Theorem|2]that both conjectures hold 
if the estimate of Proposition [1] could be improved to 0(exp(— cp^/'^t)). Mainly, in 
these proofs the concentration bound (|1.5p is only needed for the Lipschitz function 
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\\X\\, here X is either Xi or Xi + Xj (which also have an isotrophic log-concave 
distribution) . This leads to the question of whether one can remove the term in 
the Gucdon-Milman concentration result. 

Sec also Bordenave [8] for a recent work that was motivated by Conjecture [2] 

1.1.4. Possible extensions of Theorem]^ and Theorem\^ Thcorem[l]and Thcorem[2] 
could be considered as extensions of Theorem 2.3 and Theorem 2.4 of El Karoui 
in [T7] when the envelope/kernel functions are rougher. On the other hand, in [T7] 
El Karoui considered a more general setting when the covariance matrix E of Xi is 
less restrictive (in the current paper we assume T, = Id the identity operator in W, 
which is the simplest but also most natural setting). More precisely, in [17] S is 
allowed to depend on p, but is still positive definite and converges in some fashion 
to identity in the limit n,p — > oo. We anticipate that the proof of Theorem [T] and 
Theorem [5] in this paper has a natural extension that could lead to an extension of 
these Theorems to settings similar to those considered in Theorem 2.3 and 2.4 of 
[17j . however these extensions are not explored in the current paper. 

1.2. The p-dependent setting. In this section, a less classical setting recently 
investigated in [T3| will be discussed. Here, the envelope function / is allowed to 
depend on p and may have very little regularity in x. In other words, / may be 
varying with p. Examples of such situations and their motivations are presented in 
[l3] . In this paper, only the inner-product kernel g{X,Y) = X^Y will be consid- 
ered, and a similar investigation for the distance kernel is left for a further study. 
Furthermore, following |13| . only the non-diagonal model (|1.2p will be considered, 
and analogous results for the diagonal model (jl.l[) may be obtained using a diagonal 
perturbation argument. 

In this section, Xi , X2 , . • . , Xn will be iid random vectors in M.P whose coordi- 
nates are independent copies of a random variable Z with mean and variance 1 /p, 
such that for all K > there is a constant Ck depending on K such that 

(1.6) E\Z\^ <CKP-^/^ . 

While (|1.6p requires all > 0, this assumption may be improved if there are better 
bounds on the growth of a scaled version of / as p — 00. For details, see the remark 
after the statement of Theorem [3l 

Below, some standard facts about orthogonal polynomials will be recalled, for 
a standard reference see e.g. [32] or [5]. Given a nonnegative measure on M 
and k ~ 0,1,2,..., the fc*'' orthogonal polynomial pk{x) with respect to ^ is a 
polynomial of degree k with positive leading coefficient, such that 

/ Pk{x)pm{x)dn{x) = < !^ ^ f ' 
JR [1, if m = fc. 

For any function h E L'^idfi) that belongs to the span of {pn,n > 0}, one has the 

formal series 

00 „ 
^afepfc(x) , where au ■= {h,pk) ^ ^ / h{x)pk{x)dn{x) , 

fe=0 ^ 

and if the series converges to h in L'^ifJ.) then the Planchcrcl equality holds: ||/i|||2(f/^) 
Efc>o I'^'^P- that case, since ^0(2;) = 1 for probability distributions, it follows 
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that if fjL is the distribution of some random variable £^ then 



Var[/i(0] = 



k>l 



For each p let = ^X'^Y where X and Y are two iid copies of (any) vector 
Xi. Clearly has mean and variance 1. Let pk,p{x), k > 0, he the orthogonal 
polynomials with respect to the probability distribution fip of ^p. 

Below we state the conditions that will be assumed on the envelope function 
f{x,p) for the next result, Theorem [3l These conditions were first formulated in 
[13j in an equivalent form. For any f{x,p) let k{x,p) := y/pf{x/yijj,p) and consider 
the expansion 

k{x,p) ^Yak,pPk,p{x) ■ 

k>0 

In this paper, / is said to be admissible with respect to the generating distribution 
of Xi if the following three conditions hold. 

(i) (Uniform convergence) The orthogonal polynomial series of k(x,p) con- 
verges to k{x,p) in I/^(/Xp) uniformly over p large. In other words, for any 
e > there exists L = L(e) such that the following holds for p large 

(1-7) - ^ ■ 

i<L i>L 

(ii) (Normalization) There exists v g [0, oo) such that 

oo 

(1.8) lim V|a,_p|2 = . 

p— foo ^ — ^ 

i—1 

(iii) (Scaling) There exists a € [0,oo) such that 

(1.9) lim ai.p = a . 



It is clear that ()1.8p and (|1.9|) together imply the condition < v, which shall 
be assumed throughout. It is worth pointing out that the set of orthogonal poly- 
nomials with respect to a probability measure fi does not always form a complete 
basis in i^(/i), this however holds for a fairly large class of probability measures, 
including those with sub-exponential tails (i.e. Pd'^l > a;) = 0(e~'^l^l) for some 
c > 0, see e.g. [2l Theorem 6.5.2]). In particular this completeness holds if the 
measure is compactly supported. It follows that the convergence of the orthogonal 
expansion in holds automatically if Xi's are Gaussian or bounded (note that 
this gives convergence of the expansion for each p, and condition (|1.7p is about 
the uniformity of the convergence). In the general case when completeness of the 
orthogonal polynomials is not guaranteed, the condition (jl.7p has to be checked 
carefully (for both the convergence of the expansion for each p, and the uniformity 
of the convergence over p large) . 

For convenience of the reader, the definition of the Sticltjes transform m(z) of a 
measure /x is recalled below: 

m{z) = f , Im{z) > . 
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Theorem 3. Assume that f is admissible with respect to Xi 's which satisfy ()1.6p . 
Let A be generated using ()1.2p using F{X,Y,p) = f{X^Y,p). Then, the empiri- 
cal distribution of A converges weakly to a probability distribution whose Stieltjes 
transform m{z) satisfies: 

1 / 1 \ V ~ a^ , , 

1.10 - =z^a(l- + mz . 

m{z) V 1 + -m(z)/ 7 

Remark 1. As pointed out in [13j . this limiting spectral distribution is no longer 
MP when i/ ^ a^ . Unique solvability of (jl.lOp was proved in |13j using elementary 
arguments. 

The special case of Theorem |3] for Gaussian random vectors was proved by 
Cheng-Singer in |13] . Theorem [3] (positively) answered the question of Cheng and 
Singer in |13j about the validity of their result for more general distribution (such 
as Bernoulli). 

While it is assumed in Theorem [3] that (|1.6p holds for all A' > 0, the method of 
proof can be easily refined to lessen this assumption when more information about 
/ is given. More precisely, if there is an upper bound L on the cut-off degree L{e) 
in (|1.7p (independent of e > 0) then we only need to have p.6p up to Ol(1): the 
degree L will enter the proof in Lemma |9] and Lemma |8] and will eventually dictate 
the number of required moment bounds on entries of X^'s. For instance, when / 
is independent of p and has a non-vanishing derivative at a; = it can be verified 
that 1^ = 0^ = /'(O)^, and one can see from the proof of Theorem [T] that one could 
take L — 1 in (|1.7p . Eventually, with some refinements (tailored specifically for 
Theorem [T]), this leads to the requirement K > 4 in Theorem [TJ 

Acknowledgem.ent. We would like to thank X. Cheng and A. Singer for bring- 
ing this interesting subject to our attention and many useful conversations. We 
would like to thank the referees for corrections and suggestions which have lead to 
improvement of the quality of the paper. 

2. The general ideas 

Let m^(z) denote the Stieltjes transform of the empirical spectral distribution 
Pa of A; in the following itia will be refered to as the Stieltjes transform of A. 
Explicitly, 

mA{z) = -TiUA - z)-M , Im(z) > 0. 
n 

By standard reductions (see e.g. [5]), it suffices to show that to^(z) converges to 
the Stieltjes transform of the desired limiting spectral distribution (which is always 
a probability distribution in the current paper) for Im{z) > 0. For instance, in the 
setting of Theorem [3] it will be shown that tua converges to the solution of (|1.10p . 

The main idea for showing the desired convergence of niA is to compare A with 
a suitably chosen random matrix whose Stieltjes transform already has the desired 
convergence. In fact, due to a result from |13| which asserts that 

(2.1) lim \mA{z) — 'E.mA{z)\ = a.s., 

n— >oo 

it suffices to compare expected values of Stieltjes transforms in question. To keep 
the paper self-contained, a short proof of p.l|) will be included in section 13.41 (see 
Lemma [3|. 
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The proof for the results in the p-independent case wiU be presented in the next 
section. Following El Karoui |17| , A will be compared with a linear approximation 
of A, obtained by replacing the envelope function / with the linear part of its 
Taylor expansion at a suitable point. The main idea which allows us to improve 
the regularity assumptions on / in El Karoui and Cheng-Singer's results is a simple 
transference principle, see Lemma [1] and also its companion Lemma [2] in Section [3] 
for details. 

The proof of the results in the p-dependent setting will use a series of compar- 
isons. In order to carry out the analysis of the main comparison, the Lindeberg 
swapping method will be used, following ideas from I34j . This method has re- 
cently proved useful in various studies of random matrices, especially for the local 
statistics (see |35| for a survey). One of the main difficulty in implementing the 
Lindeberg method in the current setting is the lack of regularity of the envelope 
function. To overcome this difficulty, the uniform convergence condition (|1.7|) will 
be used, as this condition allows for approximation of / with polynomials (which 
are very smooth). The proof of Theorem |3] will be presented in Section l4.ll and 
Section lO 

In the rest of the paper, without loss of generality it will be assumed that Im(z) > 
0. All implicit constants in the paper may depend on z. All asymptotics notations 
are used under the assumption that p, n — > oo. 

3. The ^-independent setting 
In this section, we prove Theorem [T] and Theorem [2] 

Let A be defined by (fr2|) using f{g{X,Y)) and let A be defined by (fOj) using 
F(X,Y,p) = g(X,Y). The following transference principle asserts that one can 
deduce the limiting spectral density for A from A as long as: 

(i) / is differentiable at the mean value of g{Xi,Xj); and 

(ii) the entries of A (hence the kernel g) satisfies a fairly general concentration 
condition (relative to X^'s). 

Lemma 1 (Transference principle). Assume that converges weakly to a proba- 
blilty distribution. Let a = Eg{Xi, Xj) and let f be differentiable at x = a. Assume 

(3.1) Var[g{X,,Xi}]=0{l/p) , 
and assume that for any fixed S > it holds that 

(3.2) P(max|5(X„Xj) - a| > 5) = o(l) . 

Then A has the same limiting spectral distribution as 

B = (a/'(a) - f{a))I„ + f{a)A . 

Remarks: While different pairs (/, 5) may generate the same F = f o g, the two 
constraints p.l|) and p.2|) impose a strong normalization on g. Also, in Lemma [T] 
the spectral distribution of A is not required to be Marchenko-Pastur. 

The following simple result will also be used, which says that under an assump- 
tion on concentration of g{Xi,Xi), the models (|l.ip and (|1.2p are equivalent. 

Lemma 2. Let Ai and A2 be defined by (jl.ip and ()1.2|) respectively using F[X,Y,p) = 
f{g{X,Y)). Assume that f is continuous at b :— E,g{Xi, Xi). Assume that for any 
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S > it holds that 

(3.3) P( max \g{X„Xi) - b\ > 6) ^ o(l) . 

l<i<n 

Then in the large n large p limit it holds that 

\mAi{z + f{b)) ~ ■mA2iz)\ ^ o{l) a.s. 

Remark 2. If g{Xi, Xj) is a constant then the above result is trivial, in which case 
continuity of f at b is not needed. 

Using Lemma [51 the main argument is redueed to the non-diagonal model (jl.2p . 
where the transference principle could be used. Lemma [2] could be viewed as a 
companion of Lemma [TJ 

Below, Theorem [1] and Theorem [5] are deduced from the above two Lemmas. 
Proofs of Lemma [T] and Lemma [5] are presented in Section 13.31 

3.1. Proof of Theorem [Jl The iid case : Assume that the entries of are iid 
with K > 4 moment bounds. 

Step 1: We first reduce the Theorem to the model (|1.2p of A. By Lemma [5J it 
suffices to show that: for any fixed (5 > (i.e. independent of n^p) and any i we 
have 

(3.4) P{\\\X,f-l\>S)^o{l/p) , 

We note that a more quantative estimate was proved in |17j using more careful 
arguments, on the other hand for p.4p the following simplified argument suffices. 
Fix (5 > and let X = (a;i, . . . , Xp) be an independent copy of X^'s. Let M := p^ 
ioT Q > P > - \. Let E = {maxi \x.i\ > M}, clearly 

p{E)<cY,p-f'''nx,f ^o{i/p) . 

3 

Let X = (a;jl|^^.|<j\/)^^]^. On E'^ clearly X = X. Thus, it suffices to show that 

(3.5) P{\\\Xf-l\>5) = o{l/p^) . 

Let 1 = (1, . . . , 1) G M.P. Let fi and be the mean and variance of Xil\xi\<M- It is 
not hard to see that 

l-i = o{-^) and a'^ = - + o{-) . 
VP P P 

For p sufficiently large, it follows that 

(3.6) P{\\\X\\^-l\>S)<P{\\\X-pl\\^-p(j'^\>5/2) . 

Write ||X — — pcr^ = [(^j ~ m)^ ~ '^'^] '^^ i^*^ random variables, 

each has mean and is bounded above by 0{M^) = 0{p^^) and has the following 
variance bound: 

X := Var[(Jj - fif - a^] = E[{x^ - - ct'' = 0(p"^) . 

By ChernofF's inequality (see e.g. [33|), for Ci,C2 absolute positive constants it 
holds that 

P(|||X-/il||2-po-2| >S/2)< 

(3.7) <Cimax(exp(-^),exp(-^)) =o(p-i) . 
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Collecting inequalities p.6p . and (|3.7p . the desired estimate p.5|) follows. 

Step 2: Thanks to Step 1, it remains to show the theorem for A given by \1.2\ . 
Let (5 > be fixed. Using Lemma [I] it suffices to show that 

(3.8) ¥.[\XfXjf]^0{p-^/'^) , for some if > 4. 

Write Xi = {xn, . . . , Xip) and Xj = {xji, . . . , Xjp). Then Xik and Xjm are indepen- 
dent with mean and variance 1/p for any 1 < k,m < p. By the inverse Khintchine 
inequality (i.e. the Marcinkiewicz-Zygmund inequality) it holds that 



2\K/2 

\-^im^jm\ 

7n—l 



< Ckp'^^^-^ ^Elx^ml'^Elxjn,]^ (Holder, then independence) 

m— 1 

= 0{p^^^^) (using given moment bounds). 
The high concentration case: As in the iid case, it suffices to show p.4p and 



(|3.8p . As can be seen below, in the proof it is enough to assume (|1.5p for the 1- 
Lipschitz functions of the form f{X) = \\X + c||, c G constant vectors. See also 
the discussion after the statements of Conjecture [1] and Conjecture [21 

Proof of dM]).- Let F{Y) = \\Y\\ the Eucfidean length of F e RP, clearly F is 
1-Lipschitz. Fix i and 6 > 0, without loss of generality assume 5 < 1/2. 

We first show that uniformly over r > it holds that 

(3.9) P(|i|X,!| -1| >r) = 0(e-^(P)'^''/<^'') , 

here and below Cb will denote absolute constants that could depend on b. Let 
a = E||XJ. It sufficies to show that |a - 1| = 0(c(p)"i/''). By (fLS]) . it holds that 

|a — mi^l <]E|||Xi|| — mi?| 

POO 

= 0( / e-<P'>'-'dr) = 0(c(p)-i/'') . 



Let C be the implicit constant in the last estimate. Then for any r > 2Cc(p) 
it holds that 

P(|||Xj-a|>r) = 0(e-^(P)('^/2)^) 

In this estimate, it is clear that if r = 0(c(p)^^/'') then e^'^^P^^''/^^'' ~ 1 while the 
left hand side is at most 1. Thus, the above estimate holds uniformly over r > 0. 
Now, (13.91) follows from 



< E||X,||2 -a^ = E\\\X,\\ - = 0{c{p)-^/^) . 

We obtain 

2 ii ^ .^ _ /0(e-"(^')''''/^'), ifr>0(l); 

Oie-^P^'^"/^"), if r- = 0(l). 



(3.10) P( ||X,|1^-1 >r) = 



In particular, p.4p follows. 

Proof of jSll); We first show that 



(3.11) Pi\\\X,+X,\\-E\\X, + X,\\\>2r)^0ie-<P^'') , t^j , 
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uniformly over r > 0. For any X e let G{X) = Ey||X + y'j| the expectation 
over Y independently sampled from the distribution of Xi^s. It is clear that 

Ex,G{X,) 

Using independence of Xi,Xj, it follows that 

LHS of (imi) 

- -^'^ \\\X,+Xj\\-G{X,)\>r" -^'^ \G(X,)-Ex,G{X,)\>r' 

therefore ()3.1ip follows. 

We now show that 

(3.12) P{\\\X,+X,\\-V2\>r) =Oie-'^P^'''^^') , 

uniformly over r > 0. Let a = E||Xi + Xj||. It suffices to show that a = \/2 + 
0(c(p)~^/^). For any iiT > 1 we have 

/•oo 

E\\\X, + XjW - a]'"^ = / Kr^-^P{\\\X^ + Xj\\-a\>r)dr 
Jo 

= 0(c(p)-^/'') . 

Letting X = 2 in the above estimate, it follows that a = \/2 + 0{c{p)^'^/^). 
It follows from (IXT^ that 

p(iiix.+x,ir-2|>o=^(^1T^2'^)' ^^^(^^ = 

Combining this with (|3.10p . it follows that 

'0(e-=(p)'-''^Vc,)^ r>0(l) ; 
^0(e-=(p)'-''/'='''), r = 0(l) . 
Consequently, for any K >1 \i holds that 

nxJX.f] ^ r Kr''-^P{\XfX,\>r)dr 
Jo 

0(c(p)-^/'') + 0(c(p)-2^/'') = 0(p-^^'/2) . 



(3.13) P{\XfX,\>r) = 



Jl 



3.2. Proof of Theorem [2j For the distance model, the diagonal entries are /(O), 
therefore removing/ adding these entries does not require any regularity of /. The 
transference principle Lemma[T]will be used, and it remains to show that g{X^ Y) = 
||X — y||-^ satisfies the two kernel conditions of Lemma [T] 

The iid case: We first verify (|3.ip . Let X and Y denote Xi and Xj for some 
i ^ j ■ Then 

p 

Var[||X - r||2] = -4 + 5] E(a;, - y,)' + E " " ^j)' 

1=1 i^j 

Using Exf and Ey^ = 0(l/p^), it is clear that 

j2nX^-y^Y = 0{l/p) . 
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Using also independence of Xi, xj, yi, yj, we have 

J2 - y^rix, - y,f = 4 ^ ^ Ey^ + 0(i/p) = 4 + 0(l/p) , 

i^j k k 

and (|3.ip follows. 

Wc now verify (|3.2I) . Fix any 5 > 0. Using the previously obtained bounds p.4p 
and p.Sp . it follows from the triangle inequality that 

P(niax|||X, -Xj||2-2| > 6) 

< 2P(max|||X,|p - l| > 6/4) + P{max\XlXj\ > S/2) 
i i^j 

= 0(1) , 

as desired. 

The high concentration case: Note that (|3.2p follows from (I3.10p and (|3.13p . In 

fact, for any r > it holds that 

P[\\\X,-X,f -2\ >3r) <2P(|||X,||2-l| >r)^P{\XjX,\ > r) 

_ (0{e-<P>-"'"/^), r>2 ; 
~ jo(e-^(f)'-'/C), r = 0(1) . 

Via the same argument as before, we also obtain 

E|||X,-X,||2-2|'^ = 0(p-^/2) , 

for any K > 2, and taking K ~ 2 gives us the first kernel condition. 

3.3. Proof of Lemma [2] and Lemma [H 

Proof of Lemma[B Using p.ip . it suffices to show that E\mA^(z+f(b)) — mA^(z)\ = 
0{e) for any e > 0, which is fixed in the rest of the proof. 

Let In denote the n x n identity matrix. By a standard argument, 

\mAAz + fib))-mAM\ < 
< II (Ai - (z + /(6))I„)-i - {A2 - 
here ||.|| denotes the spectral norm of a matrix, 

<C\\Ai-f{b)In-A2\\ 
<C sup \f{g{X,,X,))-f{b)\ . 

l<i<n 

Since / is continuous at 6, there exists 6 = 6{f, e) > such that 

\f{x)~ f{b)\<e if \x~b\<S . 
Since to^^ (z), m^^ (z) = it follows that 

E|mAi(z + /(6))-mA,(z)| = 

= 0(e) + 0(p( sup \figiX,,X,))-fib)\>6)) . 

^ l<i<n ' 

Therefore E|mAi(z + f{b)) ~ m^,(z)| = 0(e), thanks to □ 
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Proof of Lemma[^ Without loss of generality we may assume that a = 0. Recall 
that A is defined using (|1.2|) . 

Let h{x) = /(O) + f'{0)x, and let B be obtained from A by replacing / with h 
(while keeping the same kernel g). More specifically, 

B = f{0)Mi+f'{0)A-f{0)In 

here I„ is the nxn identity matrix and i is the nxn matrix whose entries are all 
I's (in particular Mi has rank 1 and thus does not contribute to the limiting spectral 
distribution of B, see e.g. [5] or HI). Thus, it suffices to show that mA{z)—mB{z) — >■ 
for any z in the upper half plane. Using (j2.ip . this follows from 

(3.14) \EmA{z)-EmB{z)\^0 

which will be shown in the rest of the proof. 

Fix e > 0, it suffices to show that E|m^(z) — 771,5(2;) | = 0{e) for n, p sufficiently 
large. Let Xi{A) < ■ ■ ■ < A„(A) be the eigenvalues of A and Xi{B) < ■ ■ ■ < A„(i?) 
be the eigenvalues of B. Then for any fixed z ^ R it holds that 

\mA{z) - mB(z)p 

1 " 1 1 

- ~ II I -V7T\ T7m ' ' (Cauchy-Schwarz) 

n ^ Xi[A) - z k{B) - z 

1 " 

(3.15) < C-Y\\{A)-\,{B)\^ (using /m(z) > 0) , 

i=l 

where C > is a constant which may depend on z. It then follows from the 
Hoffman- Wielandt inequality (see e.g. [53]) that 

(3.16) \mAiz) - mB(z)p < C^^Y. I^^^' " I'' 

By definition, there is 5 > depending on / only such that 

(3.17) |/(a;) - /i(x)| < e|a;| for |a;| < J. 

Let F be the event that there is a pair i j such that \g{Xi, Xj)\ > S. It follows 
from the second assumption p.2p on g that P(F) < for n (and p) sufficiently 
large. We now estimate 

E[\mA{z) ~ mB{z)f\ < E[lj^|m^(z) - tob(z)P] + E[l;^<= |m^(z) - mB{z)\^] . 

Since mA{z) ~ 0(1) and mB{z) = 0(1), it follows that 

E[lF\mA{z) - mBiz)f] = 0(¥{F)) = 0{e^) 

for large n large p. On the other hand, from p.l6p and p.l7p . it follows that 

E[li..|m^(z)-mB(z)|2] =0(i VE(e2|g(X„X,)n) . 

With the first assumption on it follows that 

E[lF^\mA{z) - TTiBiz)]^] = Oie^) . 
Consequently, in the large 77, large p limit it holds that 

E|m^(z) - mB{z)\ < (E|mA(z) - mBiz)\^)'^^ = 0(e) . 
This completes the proof of p.l4p . 
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□ 

3.4. Concentration of the Stieltjes transform. In this section, we inelude a 
proof of (P?T|) . 

Lemma 3. [13j Let M be an p x n matrix with independent entries. Let A be 
defined by Aij = F(Mi, Mj,p) any real-valued function that is symmetric in the 
first two variables, here Mi denotes the z*'* column of M . Let A have the same 
non-diagonal entries as A, and zero diagonal entries. Fix z with Im{z) > 0. Then 

(3.18) lim \mA{z) - EmA{z)\ = a.s. 

n— j-oo 

(3.19) lim |mT(z) - EmT(z)| = a.s. 

Proof. The proof largely follows the argument in [13] , which is a variant of standard 
arguments (see e.g. [5]). Only (|3.18p will be proved below, the proof for (|3.19p is 
entirely similar. Using Borel-Cantelli's lemma, it suffices to show 

(3.20) E\mA{z)-EmA{z)\^ ^0{n-^) . 

Recall the following standard estimate (i.e. Khintchine's inequality) for the mar- 
tingale square function: 

n n 

(3.21) E|5^A,|P<CpE(^|A,f)P/2 , l<p<^ , 

where Aj = 5*^+1 — Sj the martingale difference sequence. Below, a martingale will 
be constructed such that 5*0 = Eto^(z) and Sn = ruAiz), and then show that the 
corresponding right hand side of (|3.2ip is bounded above by a suitable power of n. 

For any < A; < n let Ak be the sigma algebra generated by the last k columns 
of M. Then define Sk ■= EfcTr(A — z)~^ which is a martingale with respect to 
the filtration {^o C • • • C An} ■ Since mA^z) is measurable with respect to Am 
the construction gives Sn = nmA{z) while clearly Sq — nKmA^z). It follows from 
([3:2T|> . withp = 4, that 

E\mA{z)-EmA{z)\^ < — M . 

max I A, I I . 

It therefore suffices to show that, uniformly over j , 

(3.22) Ej[Tr(^-z)-i] -Ej_i[Tr(A-z)~i] ==0(1) 

where the implicit constant is allowed to depend on z. Fix 1 < j < n and let B 
denote the (n — 1) x (n — 1) submatrix of A, obtained by deleting the j*'' row and 
the j*'' column of A. By definition of A, the entries of B are independent of AIj, 
therefore EjB = Ej^iB and consequently it suffices for p.22p to show that 

\TiiA- z)-'^ -Ty{B~ z)-^\^0{l) . 

Since / is symmetric and real-valued, the eigenvalues of A and B are real valued. 
Furthermore, they interlace by the Cauchy interlacing theorem (see e.g. [53]). The 
desired estimate now follows immediately: 

|T,.(^ - «)- - MB - «)-| < / . O(^) ^ 

□ 
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4. The p-dependent setting 

4.1. Some estimates for orthogonal polynomials. In this section, some basie 
estimates involving ortliogonal polynomials are proved, and these estimates will be 
used in the proof of Theorem [3] 

Let hk{x) denote the fc*'' Hermite polynomial, i.e. the orthogonal polynomial 



with respect to the Gaussian measure = -^^e 



Let pfc.p denote the /c*'' orthogonal polynomial with respect to the probability 
distribution of = ^Zj^Z^ where Zi^Z^ are iid random 1 x p vectors whose 
coordinates are independently sampled from a random variable Z satisfying 

E[Z] = , Var[Z] = 1/p , 

and for some if > sufhciently large 

(4.1) IE[I^I^'] = 0(P"^/') . 

Lemma 4. Let k > 0. If (j4.1[) holds for K sufficiently large then for any S > it 
holds that 

(4.2) \Pk,pi^)~hkix)\<C6il + \x\'') , xeR , 

here the implicit constant C may depend on k (but not on p). 

Proof. For fc > let — lECp - Then for some normalization constant Ck it holds 
that (see e.g. [32]): 

/mo mi . . . mfe \ 



Pk,p{x) = Ck det 
The leading coefficient of pk,p is Ck det Mk-i where 



mfe„i mfc ... m2fc-i 
\ 1 X ... x'^ 



mo 



. m, 



"J ■■■ 

Since ||pfc.p|lL^(/jp) = li it follows that 

1 



4 = 



(detMfc_i)(dctAffe) 



and the sign of is completely determined from the sign of dctM/j-i. Therefore 
in order to show (14.21) it suffices to show that if 5 > then 



(4.3) E^^=EN'' + 0k{5) , 

where TV is the normal Gaussian Af{Q,l). But this is a classical theorem of von 
Bahr [37]. □ 

Lemma 5. Let Oj^p be the coefficients in the orthogonal expansion of a normalized 
kernel function k satisfying E[fc(^p,p)^] = 0(1) uniformly over p. Then 

\aj,p\^0{l) . 
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Proof. Clearly for any j 

k{x,p)pj^p{x)d^ip{x)\ 

< \\H;P)\\L^f.^)\\Pj,p\\L^^,^) = . 

□ 

Note that if the kernel k satisfies (|1.8p then Lemma [5] applies. 

4.2. Proof of Theorem [3l Let Ma be the p x n matrix whose columns are 
Xi, . . . Xn- Let Mg denote the p x n matrix whose columns are iid random Gauss- 
ian vectors Gi, G„, which are normalized so that the entries of Mq has mean 
and variance 1/p. A Gaussian analogue G oi A will be constructed as follows. The 
construction will ensure, thanks to [13], that the spectral density of G converges to 
the desired limiting spectral density in Theorem [S] 

Let ^G,p = y/pG'[G2 and let hg,p denote its probability distribution. Let 

oo 

K{x,p) = y^ai,pPt,p(a;) 

where P,;.p is the i*'' orthogonal polynomial with respect to ^g,p- Note that the 
above infinite sum converges uniformly in L'^{^g,p) by given assumptions on a^^p. 

Let F{x,p) = {^)^^K{x^,p) and let G be the random matrix generated from 
Mg using F. It was shown in [T^ that the Stieltjes transform mG{z) of G converges 
pointwise to the solution of (|1.10p . Therefore, using (|2.ip . it suffices to show that 
in the large n large p limit it holds that 

(4.4) ¥.[mA{z) - mG{z)] = 0{e) , 

where e > is fixed in the rest of this section. 

It follows from the given assumption p.7p that there exists L — L{e) such that 
uniform over large p it holds that X]i>L I'^i.pP ~ O(e^). Let 

L L 

kL{x,p) -.^^ah^pPh^pix) and Kl{x,p) ■.= ^ah,pPh,p{^) ■ 

h=0 h=0 

We obtain 

(4.5) E[|fc(^p,p)-fci(^p,p)p]^0(e2) , 

(4.6) E[\K{^G..p,p)-KL{^G,p,p)\']^0{e') . 

Let /l and Fl correspond to k^ and Kl, and let Al and Gl be generated from 
them respectively. 

It follows from (|4.5p and (|4.6p and Lemma |6] below that 

E[mA{z) - mA^z)] ^ Oie) , 

E[mGiz) - ruGi^iz)] = 0(e) . 

The following Lemma is in [131 , to keep the paper self-contained, a short proof of 
Lemma [S] is included. 
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Lemma 6. Let Y a random vector of length p whose coordinates are iid with mean 
and variance Let Y' be an iid copy ofY. Assume that for p large 

Elh{Y^Y\p)-f2{Y^Y',p)\'<eyp . 

Let Ai and A2 be generated from fi and /2 using n iid copies of Y . Then in the 
large p large n limit it holds that 

E|toai(z) - 771^2(2)1 = 0(e) . 

Proof. We largely follow [T3]. Following the proof of (|3.16p . it is clear that 

= Cn-'n{n - l)E[|/i(y^y',p) - f2{Y^Y',p)f] 
= 0{e') , 

which implies the desired estimate. □ 

Thus, it suffices for (|4.4p to show that E[to^j^(z) — 771(3^^(2)] = 0{e). This will 
be proved in two steps. First, it will be shown in Section [4.31 that 

(4.7) E[m^ Jz) - mg Jz)] = 0(e) , 

where Gl is generated from Mq using (as opposed to K^, which was used to 
generate Gl). Then in Section it will be shown that 

(4.8) E[mGi (z) - mg^ (z)] ^ . 

4.3. Proof of ((i?7)) : Conversion from Al to Gl- The proof of ((i?7)l will follow 
the strategy in [34] : the idea is to convert A^ to Gl in np steps, in each step one 
entry in Ma is replaced by the corresponding entry in Mq. It suffices to show that 
in every step it holds that E[Am(z)] = 0(n~^e), here Am is the difference between 
the Stieltjes transforms of the underlying matrices. This is the content of Lemma[7l 
The generic setting for each step can be formulated as follows. Let M[l] and 
M[2] denote two p x n random matrices that share the same entries except for the 
(i,j) position. Assume that these entries are independent, and their distribution 
have mean 0, variance 1/p, and higher moments bounded (with uniform constants) 
by properly scaled powers of p. Let A[l], A[2] be generated from M[l], M[2] using 
the kernel function f^. 

Lemma 7. In the large n large p limit 

(z) - mA[2] {z)] = 0{n~'^e) . 

For simplicity of notation, in the rest of this section Af [0] denotes the matrix 
that shares the same entries with M[2\ except for the (i,j) position, where 

Af [0]ij := 0. Denote by A[0] the kernel matrix generated from M[0] using Z^. 

Proof. Let E be the event that ||A[0] - A[m]\\ < Im{z)/2 for both m ^ 1,2. In 
Lemma m it will be shown that 

(4.9) E[\\A[0] - Aim]]]"] ^ Oin-"/^) , m = 1,2 , 
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here ||.|| denotes the matrix norm. Since q could be taken large (q > 4 sufhces), it 
follows that E"^ has probability o(n^^). Clearly m^[m](z) = 0(1), thus it remains 
to show 

(4.10) E(l£[m^[i](z) - m^[2](z)]) = ©(rj-^e) 

in the large n large p limit. 

Let A denote ^[1] or A[2\ in the rest of the proof, and let M e {M[l], A/[2]} be 
the corresponding sample matrix. On E, expand 

(4.11) {A - zl)-'^ = ^Rk , where 

fe>0 

Rk := [(^[0] - zI)-\A[Q] ~ A)]\A[Q] - zl)-' . 
In particular, Rq = {A[0] — zl)^^ . Note that any eigenvalue of A[Q\ — Iz is of the 
form \ — z for some eigenvalue A of A[G\. Since any such A is real, it follows that 

(4.12) \\R4<y\-^ , 

Thus on E the expansion (|4.1ip is absolutely convergence with respect to ||.|j. 

This expansion will be used to compute the leading 'asymptotics' of E[l £;m^(z)] 
as n — > oo, and show that modulo o(n~^) one obtains the same leading asymptotics 
for A = A[l] and A = A[2], and this clearly implies (|4.10p . 

In (|4.1ip . it is clear that the contributions to E[l£;m^(z)] of Rq is the same for 
A = A[l] and A = A[2], so below only Rk with fc > 1 is considered. 

Decay estimates for contribution of higher order terms in (|4.1ip : 

First, it will be shown that for some absolute constant C the following holds for 

k>l : 

(4.13) |Tr(i?fe)| < C/TO(z)-(^+i)p[0] -^f 

To show (|4.13|) . the key observation is that A[0] — A has rank at most 2. Indeed, 
this follows from the fact that at most one column and at most one row in A[Q] — A 
could be nonzero. Thus, [i?o(^[0] — A)]''Ro is of rank at most 2 and therefore 

\TiiRk)\<2j[Ro{A[0]~A)fRo 

(4.14) < 2||i?of+^||^[0] - , 

and (|4.13p follows immediately from (|4.14|) and (|4.12p . 
As a consequence of (|4.13p . it follows that 

n-iE[lB|^Tri?fc|j < Cn~^ E[\\A[0] - Af] ^ 0{n-^^^) . 

fc>3 

Asymptotics matching for the contribution of Ri: 
Rewrite 

n-^E[lETr{Ri)] 

(4.15) =7i-iE[Tr(i?i)] -n-iE[lBcTr(7?i)] . 

We first show that the second term in (|4.15p is 0(71^^). Indeed, it follows from 
Cauchy-Schwarz that 

n-i|E[li;cTr(i?i)]| 
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< n-ip(£;'=)i/2(E[|Tri?i|2])i/2 
< Cn-^F{E''y/'^{E[\\A[0] - Af])^^^ (using gH])) 

which implies the desired estimate. 

It remains to compute the asymptotic of the first term in (j4.15p . Let Eq denote 
expectation with respect to entries of M[0] and let E^'-'^ denote expectation with 
respect to the entries of M[l] and A/[2]. Since A[0] is independent of the 
entries, it follows that 

E[Tr(i?i)] = Tr Eq Rq \e^'^^ ([A[0] - A]J | i?o 

It will be shown in Lemma |9] that there exists a decomposition 

(4.16) A[0] - A ^ aM,j + bM, 



c 



where a, b, c are n x n matrices such that the following holds: 

• The entries outside the j*'' rows and the j*'' columns of a, b, c are zeros. 

• The nonzero entries of a and b depend only on / and A/[0], 

• the 2"'' moment of any entries of a and b are of size 0{n~^), and the 2"'^ 
moment of any entries of c are of size 0{n^^). 

It follows that E[||c|p] = 0{n~^) and c has rank 2, hence 

Tr[Eoiio(E^'^'^c)Eol = Tj:E\RqcRo\ = 0{\\c\\) = 0{n-'^^^) . 
It follows that 

n~iE[Tr(i?i)] = n-^Eo(TT [RoaRo]'^E^'^^ [M,j] 

+n-^Eo (tt [RobRo] ) E^*^') [Alf^] 
+0(ri-5/2) 

We remark that the first two terms are the same for A ~ A[l] and A ~ A[2]. 

Asymptotics matching for the contribution of R2: 
As before, rewrite 

?i-iE[l£Tr(i?2)] = ?i-^E[Tr(i?2)] ' ?i-^E[lBeTr(i?2)] 

and using (|4.14p it is not hard to see that the second term is o(?i^^). Therefore it 
remains to compute the asymptotics for the first term. Again, the decomposition 
(|4.16p will be used to expand 

(4.17) 7i"^E[Tr(i?2)] = ?i"^E('^)[A/2-] Eq (^Tr[(i?oa)^i?o]) + other terms . 

Since the second moment of any entries of a and b are 0{n~^) and since a and b 
have 0{n) non-zero entries, it is not hard to see that E[||a|p] = 0(1) and E[|j6|p] = 
0(1), while E[||c|p] = 0{n^^) as remarked above. Therefore using the small rank 
properties of a, b, c, the other terms in the expansion (j4.17p can be bounded by 
expected values of products of spectral norms, and eventually obtain an estimate of 
0{n~^/^). Clearly, the first term in the expansion (|4.17l) is the same for A = A[l] 
and A = A[2]. 

Finally, to complete the proof of Lcmma[7l it remains to show (|4.9p (see Lcmma[8|) 
and (|4.16p (see Lemma [9]) . □ 
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To keep the following Lemmas self-contained, the symbol H[k] will be used in- 
stead of A[k]. In the applications of these Lemma to obtain ()4.9p and ()4.16p . these 
arc the same. 

Lemma 8. For each < k < 2 let H[k] be generated from the column vectors of 
M[k] using a kernel function h{x,p) using the non-diagonal model (jl.2p . Assume 
that 

(4.18) <Cw(l + |^x|^) , 

for some N > uniform over p and x, Then for m = 1, 2 and q > it holds that 
E(||7J[0]-F[m]r) =0Ar(p-«/2) . 

Remark: To obtain (|4.9p . Lemma|S]will be applied for h{x,p) = fL{x,p), which 
satisfies (|4.18p thanks to Lemma |4] and Lemma [5] Note that in that case it holds 
that H[k] ^ A[k]. 

Proof For simphcity, let H = H[m\ and M = M[m] and U = M[0]^A/[0]. Since 
H[Q] — H has at most 0{n) nonzero entries and since 

for any square matrix B, it suffices to show that the q moment of each nonzero 
entry of H[0] — H is bounded above by 0{n~'^). 

Consider without loss of generality an (non-diagonal) entry on the j*'' row of 
H[Q] — H. This entry has the form 

ajk = h{Ujk,p) - h{Ujk + MijM,k,p) 
for some 1 < fc < n. It follows from (|4.18p that 

\ajk\ = Ow(|A/,jM,fc|(l -f ^\M,jM.,k\ + ^/p|C/Jfc 1)"^) 

Using (|1.6|) . it is not hard to see that i?[|y/pt/jfc|^] = 0(1) (for details sec for 
instance the proof of p.8p V Also, it is clear that ¥]pMijMik\^ = 0{1). Therefore, 
using Cauchy-Schwarz it follows that 

1/2 



□ 



Lemma 9. For each < k < 2 let H[k] be generated from the column vectors of 
M[k] using a kernel function h[x,p) using the non-diagonal model \1.2\ . Assume 
that for each m = 1,2,3 there is some N > such that 

am 

(4.19) |__/,(^,p)|<c^p(™-i)/2(i + |^^|A^) ^ 

where Cn is uniform over p and x. Then for m = 1, 2 it holds that 

2 

+ c , 



H[m] - H[0]^ aM[m]ij 

where a, h, c are n x n matrices such that 

(i) the entries outside the j*^ rows and columns of a, b, c are zeros, and 

(ii) the nonzero entries of a and b are independent of M[m]ij , and 
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(iii) the 2"'^ moment of any entries of a and b are of size 0(n and the 2"'^ 
moment of any entries of c are of size 0{n^'^). 

Remark: The condition ()4.19p is satisfied when h = fi thanks to Lemma |4] and 
Lemma [S] Thus, (|4.16|) follows by applying Lemma IH] to h = f^, in that case it 
holds that H[k] = A[k]. 

Proof Let H = H[m\ and M = M[m] for in G {1,2}. Let U = M[OfM[0]. It is 
clear that all nonzero entries oi H — H[0] must be in the j*^ column or the j*'* row 
and off the diagonal, thus it remains to decompose these non-zero entries. 

Consider without loss of generality an entry in the j*'' row of H — H[0], which 
has the form 

ajk = HUjk + A%M[0],fe,p) - h{Ujk,p) 
for some k ^ j. We now decompose 

ajk ^ ajkMij + bjuMfj + Cjk 

1 r i2 

ajk ^ M[Q\ikhx{Ujk,p) , bjk = -M[Q\ik hxx{Ujk,p) ■ 

Similar decomposition for the j*'' column of A[0] — A also holds. It is clear that the 
two conditions (i) and (ii) are satisfied and it remains to show that 

By given assumption, for some 6 in between Ujk and Ujk + MijM[0]ik it holds that 



1 

c,k = g 



MijM[0],k 

In particular, it follows from (I4.19P that, for some > and C depends on N, 

\Cjk\ < Cp\M^JM[0]^k\^U + iVP\^\f^ 



< 



N' 



Using Holder's inequality and the previously obtained bounds on moments of entries 
of M, A'/[0], U, it follows that 



E\c,k\' ^ Oip-') 



□ 



4.4. Proof of g^: Conversion from Gl to Gl- RecaU that ^g,p = ^/pY'^Y' 
where Y and Y' denote iid 1 x p vectors with independent Gaussian N{0,l/p) 
coordinates. Applying Lemma |6l it remains to show that: for any S > 0, the 
following holds uniformly over p large 



(4.20) 



E[\kLiiG,p,p)-KL{CG.p,p)\^] =0(6) , 



here the implicit constant depends on L. 

It follows from the triangle inequality and Lemma |4] that, for p large, 

\PkA^) - PKpix)\ < CkSil + \x\'') . 

On the other hand it is clear that E|^p|*'' = Ofe(l)- It follows that, for p large, 

\\Pk,p - Pk,p\\L2(^lG,p,m ^ C'fe'^ 
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for each k = 0,L. Now (|4.20l) follows since 

L 

Kl{x,p) - kL{x,p) = ^flfc^p Pk..p{x) -pk.,p{x) 
k=l 

and for p large ak^p = by Lemma[5] 

5. Concluding remarks 

In Theorem [21 it was assumed that the entries of X^'s are i.i.d., while in Theo- 
rem[l]and Theorem[2]it is possible to have random vectors with dependent entries, 
as long as a high concentration condition is satisfied. Cheng and Singer [T3] on the 
other hand have outlined a proof of an analogue of Theorem [3] in the setting when 
XiS are independently identically sampled from the unit sphere. This suggests that 
the i.i.d. assumption on entries of X^'s may be weakened. Our proof of Theorem [3] 
in this paper however relies on the independence of entries of Xi, more specifi- 
cally in the implementation of the Lindeberg swapping argument in Section 1431 It 
would be interesting to see if this swapping argument could be improved to extend 
Theorem [31 to settings when X^'s have dependent entries. In this direction, sec for 
instance Chatterjee [11] where some generalization of the Lindeberg principle was 
considered. 

In a different direction, one may ask questions about local statistics of the eigen- 
values of random kernel matrices. The local statistics of Wigner and covariance 
matrices have been studied extensively in the literature, see e.g. [M] or [35l for a 
comprehensive survey. However, we are not aware of any related work in the setting 
of random kernel matrices, even when the envelop function is independent of p. It 
seems that a naive adaptation of the approximation argument, carried out in this 
paper and El Karoui's work [151 [El [H! j does not lead to sufficiently interesting in- 
formation about local statistics of the eigenvalues, unless very special assumptions 
are made on the kernel. El Karoui [151 Ull HI on the other hand has been able to 
obtain some results about behavior of the largest eigenvalues via the approximation 
approach. 
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